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Abstract: In a data warehousing process, 
the phase of data integration is crucial. Many 
methods for data integration have been pub- 
hshed in the hterature. However, with the de- 
velopment of the Internet, the availability of 
various types of data (images, texts, sounds, 
videos, databases...) has increased, and struc- 
turing such data is a difficult task. We name 
these data, which may be structured or un- 
structured, "complex data". In this paper, 
we propose a new approach for complex data 
integration, based on a Multi-Agent System 
(MAS), in association to a data warehousing 
approach. Our objective is to take advantage 
of the MAS to perform the integration phase 
for complex data. We indeed consider the dif- 
ferent tasks of the data integration process as 
services offered by agents. To validate this ap- 
proach, we have actually developped an MAS 
for complex data integration. 

Keywords: Data integration. Complex data, 
ETL process, Multi- Agent Systems. 



1 Introduction 

The data warehousing and OLAP (On-Line 
Analytical Processing) technologies [Inm96t 
lKim96j are now considered mature in man- 
agement applications, especially when data 
are numerical. With the development of 
the Internet, the availability of various 
types of data (images, texts, sounds, videos, 
databases...) has increased. These data, 
which may be structured or unstructured, 
are called " complex data" . Structuring and 
exploiting these data is a difficult task and 
requires the use of efficient techniques and 
powerful tools to facilitate their integration 
into a data warehouse. It actually con- 
sists in Extracting and Transforming com- 
plex data before they are Loaded in the data 
warehouse (ETL process). 

In this paper, we propose a new approach 
for complex data integration, based on a 
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Multi- Agent System (MAS). Our approach 
consists in physically integrating complex 
data into a relational database (an ODS - 
Operating Data Storage) that we consider 
as a buffer ahead of the data warehouse. 
We are then interested in extracting, trans- 
forming and loading complex data into the 
ODS. 

The aim of this paper is to take advan- 
tages of Multi-Agent Systems that are in- 
telligent programs, composed of a set of 
agents, each one offering a set of services, 
to perform complex data integration. We 
can indeed assimilate the different tasks of 
the integration process, which is technically 
difficult, to services carried out by agents. 
Data extraction: This task is performed 
by an agent in charge of extracting data 
characteristics from complex data. The ob- 
tained characteristics are then transmitted 
to an agent responsible for data structuring. 
Data structuring: To perform this task, an 
agent deals with the organization of data 
according to a well-defined data model. 
Then, this model is transmitted to an agent 
responsible for data storage. 
Data storage: This task is performed by 
an agent that feeds the database with the 
source data, using the model supplied by 
the data structuring agent. 

In order to validate this approach, we 
have designed an MAS for complex data 
integration. This system is composed of a 
set of intelligent agents offering the different 
services that are necessary to achieve the 
integration process of complex data. It is 
based on an evolutionary architecture that 
offers a great flexibility. Our system indeed 
allows to update the existing services or to 
add/create new agents. 



The remainder of this paper is organized 
as follows. Section [2] presents a state of the 
art regarding data integration approaches 
and agent technology. In Section [3l we 
present the issue of complex data integra- 
tion and our approach. We explain the ad- 
vantages of MASs in Section H] and show 
why they are adapted to carry out this ap- 
proach via our proposed architecture. Fi- 
nally, we conclude this paper and present 
research perspectives in Section [51 

2 State of the art 

We present in this section an overview of the 
techniques our proposal relies on, namely 
those regarding data integration, the ETL 
(Extracting, Transforming, and Loading) 
process, and Multi- Agent Systems. 

2.1 Data integration 

Nowadays, two main and opposed ap- 
proaches are used to perform data integra- 
tion over the Web. 

In the mediator-based approach |Rou02j . 
the different data remain located at their 
original sources. The user is provided an 
abstract view of the data, which represents 
distributed and heterogeneous data as if 
they were stored in a centralized and homo- 
geneous system. The user's queries are ex- 
ecuted through a mediator- wrapper system 
[GLROOj . A mediator reformulates queries 
according to the content of the various ac- 
cessible data sources. A wrapper is data 
source-specific, and extracts the selected 
data from the target source. The major in- 
terest of this approach is its flexibility, since 
mediators are able to reformulate and/or 
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approximate queries to better satisfy the 
user. However, when the data sources are 
updated, modified data are lost, which is 
not pertinent in a decision support context 
where historicity is important. 

On the opposite, in the data warehouse 
approach [InmOGlTKimOGj . all the data from 
the various data sources are centralized 
in a new database, the data warehouse. 
The multidimensional data model of a data 
warehouse is analysis-oriented: data repre- 
sent indicators (measures) that can be ob- 
served according to axes of analysis (dimen- 
sions). A data warehouse actually charac- 
terizes and is optimized for one given anal- 
ysis context. In a data warehouse context, 
data integration corresponds to the ETL 
process that accesses to, cleans and trans- 
forms the heterogeneous data before they 
are loaded in the data warehouse. This 
approach supports the dating of data and 
is tailored for analysis. However, refresh- 
ing a data warehouse is a complex and 
time-consuming task that implies running 
a whole ETL process again each time an 
update is required. 

2.2 ETL process 

The classical ETL process, as its name 
hints, proceeds in three steps |Kim96j . The 
first extraction phase includes understand- 
ing and reading the data source, and copy- 
ing the necessary data in a buffer called 
the preparation zone. Then, the second 
transformation phase proceeds in several 
successive steps: clean the data from the 
preparation zone (syntactic errors, domain 
conflicts, etc.); discard some useless data 
flelds; combine the data sources (by match- 
ing keys, for instance); create new keys 



for dimensional records to avoid using keys 
that are speciflc to data sources; and build 
aggregates to optimize the more frequent 
queries. In this phase, metadata are es- 
sential to store the transformation rules 
and various correspondences. Eventually, 
the third loading phase stores the prepared 
data into multidimensional structures (data 
warehouse or data marts). It also usually 
includes an indexing phase to optimize later 
accesses. 

2.3 Multi-Agent Systems 

An agent software is a classical program 
that is qualifled as "intelligent". Intelligent 
agents are used in many flelds such as net- 
works, on-board technologies, human learn- 
ing... An intelhgent agent is supposed to 
have the following intrinsic characteristics: 
intuitive - it must be able to take initia- 
tives and to complete the actions that are 
assigned to it; reactive - it must be aware 
of its environment and act in consequence; 
sociable - it must be able to communi- 
cate with other agents and/or users |Klu01j . 
Moreover, agents may be mobile and can in- 
dependently move through an acceptor net- 
work in order to perform various tasks. 

A Multi-Agent System designates a col- 
lection of actors that communicate with 
each other [SZ96j . Each actor is able to 
offer specific services and has a well-defined 
goal. This introduces the concept of ser- 
vice: each agent is able to perform several 
tasks, in an autonomous way, and commu- 
nicates the results to a receiving actor (hu- 
man or software). The MAS must respect 
the programming standards defined by the 
FIPA (Foundation for Intelligent Physical 
Agents) |FIPn2] . 
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3 MAS-based approach 
for complex data ETL 

3.1 Complex data integration 
approach 

Data integration corresponds to the ETL 
phase in the data warehousing process. To 
achieve the integration of complex data, the 
traditional ETL approach is however not 
adapted. We present in this paper our ap- 
proach to accomplish the extracting, trans- 
forming, and loading process on complex 
data in an original way. 

In order to integrate complex data cap- 
tured from the Web, for instance, into a 
decision support database such data 
warehouse, we have proposed a full mod- 
elling process (Figure [T]). We first de- 
signed a conceptual UML model for a com- 
plex object representing a superclass of 
all the types of complex data we con- 
sider (text, multimedia documents, rela- 
tional views from databases) pDBB02aJ. 
The UML conceptual model is then directly 
translated into an XML schema (DTD or 
XML-Schema), which we view as a logical 
model. The last step in our (classical) mod- 
elling process is the production of a physical 
model in the form of XML documents that 
are stored in relational database. We con- 
sider this database as an ODS (Operational 
Data Storage), which is a data repository 
that is typically used in a traditional ETL 
process before the data warehouse proper is 
constituted. However, note that our objec- 
tive is not only to store data, but also to 
truly prepare them for analysis. 




Physical Moflfl 



Figure 1: Classical modelling process for 
complex data integration 



3.2 MAS-based prototype 

The integration of complex data is more dif- 
ficult than a classical ETL process. This 
technically difficult integration process re- 
quires a succession of tasks that we assim- 
ilate to services that may be carried out 
by agents. To effectively achieve this goal, 
we have designed an MAS-based prototype. 
Its architecture is presented in Figure [21 It 
is based on a platform of generic agents. 
We have instantiated five agents offering 
services that allow the integration of com- 
plex data. The purpose of this collection 
of agents is to perform several tasks. Each 
agent is able to offer specific services and 
has a well-defined goal. 

The first main agent created in our proto- 
type, MenuAgent, pilots the system, super- 
vises agent migrations, and indexes the ac- 
cessible sites from the platform. Some oth- 
ers default pilot agents help in the manage- 
ment of the agents and provide an interface 
for the agent development platform. 

The essential of the integration process 
is achieved through services about collect- 
ing, structuring, generating and storing 
data, provided by the remaining agents we 
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Figure 2: MAS-based ETL architecture for 
complex data integration 



lected by the user. A particular treatment 
is applied, depending on the subdocument 
class (image, sound, etc.), since each sub- 
document class bears different attributes. 
The DataAgent agent uses three ways to 
extract the actual data: (1) it communi- 
cates with the user through graphical inter- 
faces, allowing a manual capture of data; 
(2) it uses standard Java methods and pack- 
ages; (3) it uses other ad-hoc automatic ex- 
traction algorithms [DBB"'"02b] . Our objec- 
tive is to progressively reduce the number 
of manually-captured attributes and to add 
new attributes that would be useful for later 
analysis and that could be obtained with 
data mining techniques. This work is com- 
pleted by the WrapperAgent agent that in- 
stantiates the UML structure based on the 
data supplied by the DataAgent agent. 



present in the next section. 

To develop our prototype, we have built a 
platform using JADE version 2.61 |JAD02j 
and the Java language |Sun02] . which is 
portable across agent programming plat- 
forms. The prototype is freely available on- 
line |BE03j . 

3.3 Complex data ETL 
3.3.1 Extracting 

Recall that our modelling approach corre- 
sponds to the complex data integration pro- 
cess. The conceptual level helps the user se- 
lecting the data and establishing its analysis 
goals. The Extraction phase is thus carried 
out by the DataAgent agent that collects 
the data concerning the documents. This 
task consists in extracting the attributes 
of the complex object that has been se- 



3.3.2 Transforming 

The logical level coincides with the Trans- 
forming phase. Our UML conceptual 
model is directly translated by the XML- 
Creator agent into an XML schema (DTD 
or XML-Schema), which we view as a log- 
ical model. XML is the format of choice 
for both storing and describing the data. 
The schema indeed represents the meta- 
data. XML is also very interesting because 
of its flexibility and extensibility, while al- 
lowing straight mapping into a conventional 
database if strong structuring and retrieval 
efficiency are needed for analysis purposes. 

3.3.3 Loading 

The last level in our modelling process cor- 
responds to the Loading phase. It consists 
in the production of a physical model in the 
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form of XML documents and their loading 
into a relational database. This is achieved 
by the XMLCreator and XML2RDBAgent 
agents. The principle of the XMLCre- 
ator agent's service is to parse the XML 
schema recursively, fetching the elements it 
describes, and to write them into the output 
XML document, along with the associated 
values extracted from the original data, on 
the fly. Missing values are currently treated 
by inserting an empty element, but strate- 
gies could be devised to solve this prob- 
lem, either by prompting the user or auto- 
matically. The XML documents obtained 
with the help of the XMLCreator agent are 
mapped into a relational database by the 
XML2RDB Agent agent. It operates in two 
steps. First, a DTD parser exploits our log- 
ical model (XML schema) to build a rela- 
tional schema, i.e., a set of tables in which 
any valid XML document (regarding our 
DTD) can be mapped. To achieve this goal, 
we mainly used the techniques proposed by 
|ABK+nn[ IKKROnj . Note that our DTD 
parser is a generic tool: it can operate on 
any DTD. It takes into account all the XML 
element types we need, e.g., elements with 
-|-, *, or ? multiplicity, element lists, selec- 
tions, etc. The last and easiest step consists 
in loading the valid XML documents into 
the previously build relational structure. 

4 Justification 

The variety of data types (images, texts, 
sounds, videos, databases...) increases the 
complexity of data. It is thus necessary 
to structure them in an " un-classical" way. 
Because data are complex, they necessitate 
more information. Furthermore, it is im- 



portant to consider this information and to 
represent it in the form of metadata. Then, 
the choice of the XML formalism is fully 
justified. Since our proposal is based on 
a classical modelling process, it allows the 
user to determine what are his/her analy- 
sis objectives, to select how to represent the 
data and how to store them into a database. 
It constitutes a whole process permitting to 
carry out the integration of complex data. 
This is also the objective of the ETL pro- 
cess. 

Our proposed process necessitates several 
tasks that must be performed, repetitively. 
These tasks are not necessarily sequential, 
and are assimilated to services offered by 
well-defined agents in a system intended to 
achieve such an integration process. With 
this goalin mind, we have developed a MAS- 
based prototype that is based upon a flexi- 
ble and evolutive architecture on which we 
can updated services, and even create new 
agents to consider data refreshing, analysis 
and so on. 

5 Conclusion and Per- 
spectives 

In this paper, we have proposed a new ap- 
proach for complex data integration based 
on both the data warehouse technology 
and multi-agent systems. This approach is 
based on a flexible and evolutive architec- 
ture on which we can add, remove or modify 
services, and even create new agents. We 
then developped a MAS-based prototype 
that allows this integration with respect to 
the following three steps of the ETL pro- 
cess. Two agents named DataAgent and 
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WrapperAgent, respectively, model the in- 
put complex data into UML classes. The 
XMLCreator agent translates UML classes 
into XML documents that are mapped in a 
relational database by the XML2RDBAgent 
agent. Moreover, note that the different 
agents that compose our system are mobile 
and that the services they propose coincide 
with the ETL asks of the data warehousing 
process. 

We plan to extend the services offered 
by our MAS-based prototype, especially for 
extracting data from their sources and an- 
alyzing them. For example, the Data Agent 
agent could converse with on-line search en- 
gines and exploit their answers. On the 
other hand, we could also create new agents 
in charge of modelling data multidimension- 
ally in order to apply analysis methods such 
as OLAP or data mining. 
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